Tidyverse

Data Manipulation with `dplyr`

dplyr is designed for easy data manipulation using verbs that describe the operations you want to perform. These common functions are:
filter() : Select rows based on conditions.
select() : Choose columns to keep.
mutate(): Add new columns or modify existing ones.
arrange(): Sort the data.
summarize(): reduces multiple values down to a single summary
group_by(): Aggregate data by groups.

`dplyr` verbs

Taking diamonds dataset of the prices and other attributes of almost 54,000 diamonds (see ?diamonds).

# understand the dataset of diamonds
library(ggplot2)
class(diamonds)

[1] "tbl_df"     "tbl"        "data.frame"

str(diamonds)

tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

# we can also use tidyverse function
library(dplyr)
glimpse(diamonds)

Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

`select()`

diamonds %>% select(carat)

# A tibble: 53,940 × 1
   carat
   <dbl>
 1  0.23
 2  0.21
 3  0.23
 4  0.29
 5  0.31
 6  0.24
 7  0.24
 8  0.26
 9  0.22
10  0.23
# ℹ 53,930 more rows

Selecting more than one column :

diamonds %>% select(carat, cut, color, price)

# A tibble: 53,940 × 4
   carat cut       color price
   <dbl> <ord>     <ord> <int>
 1  0.23 Ideal     E       326
 2  0.21 Premium   E       326
 3  0.23 Good      E       327
 4  0.29 Premium   I       334
 5  0.31 Good      J       335
 6  0.24 Very Good J       336
 7  0.24 Very Good I       336
 8  0.26 Very Good H       337
 9  0.22 Fair      E       337
10  0.23 Very Good H       338
# ℹ 53,930 more rows

Alternate choices

diamonds %>% select(carat : color, price)

# A tibble: 53,940 × 4
   carat cut       color price
   <dbl> <ord>     <ord> <int>
 1  0.23 Ideal     E       326
 2  0.21 Premium   E       326
 3  0.23 Good      E       327
 4  0.29 Premium   I       334
 5  0.31 Good      J       335
 6  0.24 Very Good J       336
 7  0.24 Very Good I       336
 8  0.26 Very Good H       337
 9  0.22 Fair      E       337
10  0.23 Very Good H       338
# ℹ 53,930 more rows

#selcting columns starting with 'c'
diamonds %>% select(starts_with("c"))

# A tibble: 53,940 × 4
   carat cut       color clarity
   <dbl> <ord>     <ord> <ord>  
 1  0.23 Ideal     E     SI2    
 2  0.21 Premium   E     SI1    
 3  0.23 Good      E     VS1    
 4  0.29 Premium   I     VS2    
 5  0.31 Good      J     SI2    
 6  0.24 Very Good J     VVS2   
 7  0.24 Very Good I     VVS1   
 8  0.26 Very Good H     SI1    
 9  0.22 Fair      E     VS2    
10  0.23 Very Good H     VS1    
# ℹ 53,930 more rows

#select all the columns but carat
diamonds %>% select(-carat)

# A tibble: 53,940 × 9
   cut       color clarity depth table price     x     y     z
   <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Now you do this select all the columns but not the ones with a name starting with “c”

diamonds %>% select(- starts_with("c"))

# A tibble: 53,940 × 6
   depth table price     x     y     z
   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  61.5    55   326  3.95  3.98  2.43
 2  59.8    61   326  3.89  3.84  2.31
 3  56.9    65   327  4.05  4.07  2.31
 4  62.4    58   334  4.2   4.23  2.63
 5  63.3    58   335  4.34  4.35  2.75
 6  62.8    57   336  3.94  3.96  2.48
 7  62.3    57   336  3.95  3.98  2.47
 8  61.9    55   337  4.07  4.11  2.53
 9  65.1    61   337  3.87  3.78  2.49
10  59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

`filter`

diamonds %>% filter(cut == "Premium")

# A tibble: 13,791 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
 2  0.29 Premium I     VS2      62.4    58   334  4.2   4.23  2.63
 3  0.22 Premium F     SI1      60.4    61   342  3.88  3.84  2.33
 4  0.2  Premium E     SI2      60.2    62   345  3.79  3.75  2.27
 5  0.32 Premium E     I1       60.9    58   345  4.38  4.42  2.68
 6  0.24 Premium I     VS1      62.5    57   355  3.97  3.94  2.47
 7  0.29 Premium F     SI1      62.4    58   403  4.24  4.26  2.65
 8  0.22 Premium E     VS2      61.6    58   404  3.93  3.89  2.41
 9  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
10  0.3  Premium J     SI2      59.3    61   405  4.43  4.38  2.61
# ℹ 13,781 more rows

#include more conditions by using '&' or 'comma'
diamonds %>% filter(cut == "Premium" & color == "D")

# A tibble: 1,603 × 10
   carat cut     color clarity depth table price     x     y     z
   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Premium D     VS2      59.3    62   404  3.91  3.88  2.31
 2  0.3  Premium D     SI1      62.6    59   552  4.23  4.27  2.66
 3  0.71 Premium D     SI2      61.7    59  2768  5.71  5.67  3.51
 4  0.71 Premium D     VS2      62.5    60  2770  5.65  5.61  3.52
 5  0.7  Premium D     VS2      58      62  2773  5.87  5.78  3.38
 6  0.72 Premium D     SI1      62.7    59  2782  5.73  5.69  3.58
 7  0.7  Premium D     SI1      62.8    60  2782  5.68  5.66  3.56
 8  0.72 Premium D     SI2      62      60  2795  5.73  5.69  3.54
 9  0.71 Premium D     SI1      62.7    60  2797  5.67  5.71  3.57
10  0.71 Premium D     SI1      61.3    58  2797  5.73  5.75  3.52
# ℹ 1,593 more rows

#The output of a selection can be saved in a new data frame
myselection = diamonds %>% filter(between(price, 500,600))

`summarise`

#mean and median of price.

diamonds %>% 
  summarise(mean(price),
            median(price))

# A tibble: 1 × 2
  `mean(price)` `median(price)`
          <dbl>           <dbl>
1         3933.            2401

#the number of diamonds with a price > 15000$
diamonds %>% 
  summarise(veryexp = sum(price > 15000),
            veryexpprop = mean(price>15000),
            veryexpperc = mean(price>15000)*100)

# A tibble: 1 × 3
  veryexp veryexpprop veryexpperc
    <int>       <dbl>       <dbl>
1    1655      0.0307        3.07

`group_by`

group_by() is used to group rows of a data frame by one or more columns.
This helps in performing operations like summarizing or aggregating data by categories.

diamonds %>% 
  group_by(cut,color) %>% 
  summarise(mean(price))

# A tibble: 35 × 3
# Groups:   cut [5]
   cut   color `mean(price)`
   <ord> <ord>         <dbl>
 1 Fair  D             4291.
 2 Fair  E             3682.
 3 Fair  F             3827.
 4 Fair  G             4239.
 5 Fair  H             5136.
 6 Fair  I             4685.
 7 Fair  J             4976.
 8 Good  D             3405.
 9 Good  E             3424.
10 Good  F             3496.
# ℹ 25 more rows

`mutate`

used to create new column or modify existing columns in the data frame

newdiamonds = diamonds %>% 
  mutate(newcol = ifelse(price < 1000, "Yes", "No"))
# Create a new column to categorize products as "Yes" if the price is less than $1000, and "No" otherwise.

# Derive the frequency distribution of 'newcol' along with percentages.
newdiamonds %>% 
  count(newcol) %>% 
  #summarise(perc=n/nrow(newdiamonds)*100)
  mutate(perc=n/nrow(newdiamonds)*100)

# A tibble: 2 × 3
  newcol     n  perc
  <chr>  <int> <dbl>
1 No     39441  73.1
2 Yes    14499  26.9

`arrange`

#sort diamonds according to price
diamonds %>% 
  arrange(price) %>% 
  tail

# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
2  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
3  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
4  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
5  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
6  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16

diamonds %>% 
  arrange(desc(price))

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

Example with dplyr verbs combined

Taking ‘mtcars’ and filtering rows where mpg > 20, select specific columns, create a new column, and arrange the data by mpg in descending order.

library(dplyr)

mtcars %>% 
  filter(mpg > 20) %>%
  select(mpg, wt, hp) %>% 
  mutate(mpg_per_weight = mpg / wt) %>% 
  arrange(desc(mpg))

                mpg    wt  hp mpg_per_weight
Toyota Corolla 33.9 1.835  65      18.474114
Fiat 128       32.4 2.200  66      14.727273
Honda Civic    30.4 1.615  52      18.823529
Lotus Europa   30.4 1.513 113      20.092531
Fiat X1-9      27.3 1.935  66      14.108527
Porsche 914-2  26.0 2.140  91      12.149533
Merc 240D      24.4 3.190  62       7.648903
Datsun 710     22.8 2.320  93       9.827586
Merc 230       22.8 3.150  95       7.238095
Toyota Corona  21.5 2.465  97       8.722110
Hornet 4 Drive 21.4 3.215 110       6.656299
Volvo 142E     21.4 2.780 109       7.697842
Mazda RX4      21.0 2.620 110       8.015267
Mazda RX4 Wag  21.0 2.875 110       7.304348

Tidyverse

What is Tidyverse?

Why use Tidyverse ?

Key Packages in Tidyverse

Installation and Setup

Pipe Operator( %>% )

Example of pipe operators

Package `dplyr`

Data Manipulation with `dplyr`

`dplyr` verbs

`select()`

`filter`

`summarise`

`group_by`

`mutate`

`arrange`

Example with dplyr verbs combined

Thanks